Loading the libraries required for this analysis
# install.packages("corrplot")
# install.packages("PerformanceAnalytics")
library(tidyverse)
library(kableExtra)
library(lmtest)
library(MASS)
The Living Costs and Food Survey (LCF) is an annual survey carried out in United Kingdom by Office for National Statistics since 1957. It collects data on spending pattern and the cost of living of households across UK.
The LCF sample for Great Britain is a multi-stage stratified random sample with clustering. Address with ‘small user’ postcodes are drawn from the postcode address file. The LCF sample for Northern Ireland, which is part for Great Britain is collected by the central survey unit of Northern Island Statistics and Research Agency(NISRA). A systematic random sample of private addresses is drawn from the land and property service agency property database.
LCF is a continuous survey which is collected throughout the year to ensure seasonal effects are covered.
Randomly about 11,000 private households are selected each year.
Since it is completely voluntary, households can choose not to respond to the survey.
Every year on average about 50% of 11,000 households choose to respond to the survey
Volunteering household needs to fill a Household questionnaire, Individual questionnaire and dairy to track the daily expenditure for 2 weeks for all individuals aged 16 and over.
The LCF provides information for the Retail Prices Index, National Accounts estimates of household expenditure, the analysis of the effect of taxes and benefits and trends in nutrition. The results, however, are multi-purpose, providing an invaluable supply of economic and social data.
The dataset 1 which we are analysing in this project is teaching dataset which is a subset of LCF 2013 survey. This dataset has been simplified for the purpose of learning and teaching. This dataset has been anonymised and deposited with the UK data service and can be found here.
LCF 2013 survey has 5,144 respondents out of which 151 were from northern Ireland. Each row in the dataset contains observations from one household.
Household spending is the amount of final consumption expenditure made by resident households to meet their everyday needs, such as food, clothing, housing (rent), energy, transport, durable goods (notably cars), health costs, leisure, and miscellaneous services. It is typically around 60% of gross domestic product (GDP) and is therefore an essential variable for economic analysis of demand 2.
Economists have traditionally relied on reported household income and expenditure as preferred indicators of poverty and living standards but the use of such indicators can be problematic. Their measurement ususally require lengthly modules and detailed questions which are not practical for households with other priorities. Also the resulting data from those modules can contain errors or reporting biases. 3
As part of this study, we would like to infer the household expenditure based on income and other socioeconomic indicators of the household. The model built will be tested to determine how well it performs at predicting the expenditure of the household.
The variables in the original dataset are
| Variable name | Variable label | Variable type |
|---|---|---|
| casenew | Randomly generated case number | Scalar |
| weighta | Annual weight | Scalar |
| P550tpr | Total expenditure, by adults & children (top-coded) | Scalar |
| P344pr | Gross normal weekly household income (top-coded) | Scalar |
| P425r | Main source of household income | Nominal |
| A172 | Internet connection in household | Nominal |
| A093r | Economic position of Household Reference Person | Nominal |
| A094r | NS-SEC 3 class of Household Reference Person | Nominal |
| A121r | Tenure type | Nominal |
| SexHRP | Sex of Household Reference Person | Nominal |
| A049r | Number of persons in household | Ordinal |
| G018r | Number of adults in household | Ordinal |
| G019r | Number of children in household | Ordinal |
| Gorx | Government Office Region - modified | Nominal |
| weightar | Weight (rescaled) | Scalar |
| maininc | Main source of household income (recoded, P425-1) | Nominal |
| income | Income | Scalar |
| expenditure | Total expenditure (top coded, formerly P550tpr) | Scalar |
| hhsize | Household size, number of people in household (recoded)formerly A049r | Nominal |
Loading data from the tab delimited file:
lcf_data_raw = read.table("icfforworkbook.tab", sep = "\t", header = TRUE)
colnames(lcf_data_raw)[colnames(lcf_data_raw)=="G018r"] = "noAdults"
colnames(lcf_data_raw)[colnames(lcf_data_raw)=="G019r"] = "noChildren"
colnames(lcf_data_raw)[colnames(lcf_data_raw)=="A049r"] = "noPersons"
# remove rows that have zero income
lcf_data_raw = lcf_data_raw[lcf_data_raw$income > 0,]
# Converting to factor variable
# Income Source
# 1 = EarnedIncome
# 2 = OtherIncome
lcf_data_raw$incomeSource = ifelse(lcf_data_raw$P425r==1,"EarnedIncome","OtherIncome")
lcf_data_raw$incomeSource = as.factor(lcf_data_raw$incomeSource)
# Internet Conncetion in household
# 1 - Yes
# 2 - No
lcf_data_raw$internet = ifelse(lcf_data_raw$A172==1, "Yes", "No")
lcf_data_raw$internet = as.factor(lcf_data_raw$internet)
# Economic position of Household Reference Person
# 1 - Full-time working
# 2 - Part-time working
# 3 - Unemployed and work related Government Training Programmes
# 4 - Economically inactive
lcf_data_raw$economicHRP = ifelse(lcf_data_raw$A093r==1,"FullTime",ifelse(lcf_data_raw$A093r==2,"PartTime",ifelse(lcf_data_raw$A093r==3,"Unemployed","EconomicallyInactive")))
lcf_data_raw$economicHRP = as.factor(lcf_data_raw$economicHRP)
# Class of Household Reference Person
# 1 - Higher managerial, administrative and professional occupations
# 2 - Intermediate occupations
# 3 - Routine and manual occupations
# 4 - Never worked and long term unemployed, students and occupation not stated
# 5 - Not classified for other reasons
lcf_data_raw$SEC3Class = ifelse(lcf_data_raw$A094r==1, "Class1", ifelse(lcf_data_raw$A094r==2, "Class2", ifelse(lcf_data_raw$A094r==3, "Class3", ifelse(lcf_data_raw$A094r==4, "Class 4", "Class 5"))))
lcf_data_raw$SEC3Class = as.factor(lcf_data_raw$SEC3Class)
# Tenure Type
# 1 - Public Rented
# 2 - Private Rented
# 3 - Owned
lcf_data_raw$tenureType = ifelse(lcf_data_raw$A121r==1, "PublicRented", ifelse(lcf_data_raw$A121r==2, "PrivateRented", "Owned"))
lcf_data_raw$tenureType = as.factor(lcf_data_raw$tenureType)
# Sex of household reference person
# 1 - Male
# 2 - Female
lcf_data_raw$sex_HRP = ifelse(lcf_data_raw$SexHRP==1, "Male", "Female")
lcf_data_raw$sex_HRP = as.factor(lcf_data_raw$sex_HRP)
# Number of children in the household
#lcf_data_raw$NoChildren = ifelse(lcf_data_raw$G019r==1, "None", ifelse(lcf_data_raw$G019r==2, "OneChild", "TwoOrMore"))
#lcf_data_raw$NoChildren = as.factor(lcf_data_raw$NoChildren)
# select columns
cols = c("noAdults", "noPersons", "noChildren", "income", "expenditure", "hhsize", "incomeSource", "internet", "economicHRP", "SEC3Class", "tenureType", "sex_HRP")
lcf_data = lcf_data_raw[cols]
# expenditure is top coded so removing those values
#lcf_data = filter(lcf_data,expenditure!=max(lcf_data$expenditure))
Sample data:
head(lcf_data) %>%
kable() %>% kable_styling(c("striped","bordered"))
| noAdults | noPersons | noChildren | income | expenditure | hhsize | incomeSource | internet | economicHRP | SEC3Class | tenureType | sex_HRP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 4 | 3 | 465.36 | 380.6958 | 4 | EarnedIncome | Yes | EconomicallyInactive | Class3 | PublicRented | Female |
| 2 | 2 | 1 | 855.26 | 546.4134 | 2 | EarnedIncome | Yes | FullTime | Class 4 | Owned | Female |
| 1 | 1 | 1 | 160.96 | 242.1890 | 1 | EarnedIncome | Yes | FullTime | Class2 | Owned | Female |
| 2 | 2 | 1 | 656.22 | 421.3824 | 2 | EarnedIncome | Yes | FullTime | Class3 | Owned | Male |
| 1 | 1 | 1 | 398.80 | 370.4056 | 1 | EarnedIncome | Yes | FullTime | Class 4 | Owned | Male |
| 1 | 1 | 1 | 321.02 | 172.3972 | 1 | EarnedIncome | No | FullTime | Class3 | PublicRented | Male |
The structure of the dataframe is:
str(lcf_data)
## 'data.frame': 5132 obs. of 12 variables:
## $ noAdults : int 2 2 1 2 1 1 2 4 2 2 ...
## $ noPersons : int 4 2 1 2 1 1 2 5 2 2 ...
## $ noChildren : int 3 1 1 1 1 1 1 1 1 1 ...
## $ income : num 465 855 161 656 399 ...
## $ expenditure : num 381 546 242 421 370 ...
## $ hhsize : int 4 2 1 2 1 1 2 5 2 2 ...
## $ incomeSource: Factor w/ 2 levels "EarnedIncome",..: 1 1 1 1 1 1 2 1 1 2 ...
## $ internet : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
## $ economicHRP : Factor w/ 4 levels "EconomicallyInactive",..: 1 2 2 2 2 2 1 1 2 1 ...
## $ SEC3Class : Factor w/ 5 levels "Class 4","Class 5",..: 5 1 4 5 1 5 2 2 4 2 ...
## $ tenureType : Factor w/ 3 levels "Owned","PrivateRented",..: 3 1 1 1 1 3 1 3 1 1 ...
## $ sex_HRP : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 1 2 2 ...
The following functions will be used to evaluate the models.
get_bp_decision = function(model, alpha = 0.01) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha = 0.01) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_rmse = function(model) {
mean(model$residuals ^ 2)
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
# This function takes the model as the input and test
# a) Normality
# b) Linearity
# c) Constant variance
assumption_test = function(model){
par(mfrow=c(1,2))
# Fitted vs residual plot
plot(fitted.values(model),resid(model),col="dodgerblue", xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals")
abline(h=0,col="red",lwd=2)
# QQ plot to test normality
qqnorm(resid(model),col="dodgerblue")
qqline(resid(model),lwd=2,col="red")
}
First we see any correlation between the numeric varaiables, this will help us removing the highly corrleated variables.
Figure 1 - Correlation and Scatterplot of the variables of the data set
The correlation shown in Figure 1 shows a strong correlation between hhsize and noPersons, referring to the documentation shows that are they capture the same data.
hhsize = noPersons
noPersons = noChild + noAdult
we will remove noPersons , noAdult ,noChildren from the dataset .
# Deleting the columns
lcf_data$noPersons = NULL
lcf_data$noAdults = NULL
lcf_data$noChildren = NULL
Lets do a correlation plot one more time
PerformanceAnalytics::chart.Correlation(select_if(lcf_data,is.numeric), histogram = TRUE, pch = 19)
Lets create a additive model first
model_add_all = lm(expenditure ~ ., data = lcf_data)
summary(model_add_all)
##
## Call:
## lm(formula = expenditure ~ ., data = lcf_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570.77 -120.27 -29.57 87.19 944.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.10917 22.34003 1.348 0.17779
## income 0.45755 0.01196 38.256 < 2e-16 ***
## hhsize 40.28775 2.76118 14.591 < 2e-16 ***
## incomeSourceOtherIncome 18.52177 10.21878 1.813 0.06996 .
## internetYes 66.91895 8.30417 8.058 9.54e-16 ***
## economicHRPFullTime -26.02638 14.11971 -1.843 0.06535 .
## economicHRPPartTime 5.80551 14.47181 0.401 0.68832
## economicHRPUnemployed -34.02913 19.15582 -1.776 0.07572 .
## SEC3ClassClass 5 -6.79073 17.97521 -0.378 0.70561
## SEC3ClassClass1 51.29857 17.79294 2.883 0.00395 **
## SEC3ClassClass2 28.83109 18.31898 1.574 0.11559
## SEC3ClassClass3 -21.27501 17.45756 -1.219 0.22303
## tenureTypePrivateRented 22.02845 8.22774 2.677 0.00744 **
## tenureTypePublicRented -44.78014 8.35520 -5.360 8.71e-08 ***
## sex_HRPMale 19.49748 6.09078 3.201 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 197.9 on 5117 degrees of freedom
## Multiple R-squared: 0.5431, Adjusted R-squared: 0.5419
## F-statistic: 434.5 on 14 and 5117 DF, p-value: < 2.2e-16
Selecting only the important variables from the above model. Having a smaller model will help interpret the results.
model_add= lm(formula = expenditure ~ income + hhsize +internet, data = lcf_data)
summary(model_add)
##
## Call:
## lm(formula = expenditure ~ income + hhsize + internet, data = lcf_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -575.95 -123.20 -31.76 83.95 941.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.021354 7.722977 3.887 0.000103 ***
## income 0.491726 0.009211 53.386 < 2e-16 ***
## hhsize 35.427781 2.667127 13.283 < 2e-16 ***
## internetYes 74.925596 8.150891 9.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.6 on 5128 degrees of freedom
## Multiple R-squared: 0.5299, Adjusted R-squared: 0.5296
## F-statistic: 1927 on 3 and 5128 DF, p-value: < 2.2e-16
With just the important variable, \(R^2\) did not go down much. Since above simple additive model will help us understand how expenditure is explained by other factor.
assumption_test(model_add)
Figure 3 - Residuals vs Fitted and Normal QQ Plot for the model
To infer from the model , assumptions of linear model should be met.
L - Response is Linear combination of Predictors
I - the erros are Independent
N - the error should follow Normal distribution
E - the error variance is Equal (constant) at any set of predictor values
From the fitted Vs Residual plot, * it seems constant varaiance is a suspect. * Normality is suspect
We need to do a transformation to so that we are not violating the above assumptions.
First let’s try log transformation
First let’s try log transformation
# Log transformation of response variable
model_log = lm(log(expenditure)~ income + hhsize + internet, data = lcf_data)
summary(model_log)
##
## Call:
## lm(formula = log(expenditure) ~ income + hhsize + internet, data = lcf_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61228 -0.27563 0.00905 0.27172 1.62147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.786e+00 1.698e-02 281.84 <2e-16 ***
## income 1.096e-03 2.025e-05 54.11 <2e-16 ***
## hhsize 9.043e-02 5.864e-03 15.42 <2e-16 ***
## internetYes 3.517e-01 1.792e-02 19.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.441 on 5128 degrees of freedom
## Multiple R-squared: 0.5858, Adjusted R-squared: 0.5855
## F-statistic: 2417 on 3 and 5128 DF, p-value: < 2.2e-16
# We will leverage the function we create prior to test if the linear model assumtions are violating
assumption_test(model_log)
Figure 4 - Residuals vs Fitted and Normal QQ Plot for the log transformed model
from the above graph ,
We will try Box-Cox transformation
The Box-Cox method considers a family of transformations on strictly positive response variables, \[ g_\lambda(y) = \left\{ \begin{array}{lr}\displaystyle\frac{y^\lambda - 1}{\lambda} & \lambda \neq 0\\ & \\ \log(y) & \lambda = 0 \end{array} \right. \]
The λ parameter is chosen by numerically maximizing the log-likelihood,
model =lm(expenditure ~ income + hhsize + internet, data = lcf_data)
bc=boxcox(model,lambda = seq(-3,3))
Figure 5 - BoxCox for the simple model
(lambda = best_lambda = bc$x[which(bc$y==max(bc$y))])
## [1] 0.2727273
model_bc = lm((expenditure^lambda) ~ ( income + hhsize + internet), data = lcf_data)
summary(model_bc)
##
## Call:
## lm(formula = (expenditure^lambda) ~ (income + hhsize + internet),
## data = lcf_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05082 -0.39991 -0.02937 0.37376 2.43820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.605e+00 2.288e-02 157.58 <2e-16 ***
## income 1.533e-03 2.728e-05 56.19 <2e-16 ***
## hhsize 1.216e-01 7.900e-03 15.39 <2e-16 ***
## internetYes 4.080e-01 2.414e-02 16.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5941 on 5128 degrees of freedom
## Multiple R-squared: 0.5883, Adjusted R-squared: 0.588
## F-statistic: 2442 on 3 and 5128 DF, p-value: < 2.2e-16
# We will leverage the function we create prior to test if the linear model assumtions are violating
assumption_test(model_bc)
Figure 6 - Residuals vs Fitted and Normal QQ Plot for the BoxCox transformed model
Eventhough we see little bit inconsistent in the varaiance, generally it looks ok.
Based on the above model, lets find the Train RMSE & Test RMSE.Test RMSE will tell how well the model is performing in the unseen data.
# training and testing datasets
set.seed(420)
# Splitting the lcf dataset into train & test . 80% data goes to train & 20% data goes to test.
lcf_trn_idx = sample(nrow(lcf_data), size = trunc(0.80 * nrow(lcf_data)))
lcf_trn_data = lcf_data[lcf_trn_idx, ]
lcf_tst_data = lcf_data[-lcf_trn_idx, ]
# Building the model using only the training data
model_train = lm(expenditure^lambda ~ income + hhsize + internet, data = lcf_trn_data)
# Train RMSE - prediction on Train using the model build on train dataset
(train_rmse = sqrt(mean((predict(model_train)^(1/lambda) - lcf_trn_data$expenditure)^2)))
## [1] 204.5318
# Test RMSE - Prediction on the Test dataset which is not used for prediction.
# This is a good assessemnet on how this model will perform in the unseen data
(test_rmse =sqrt(mean((predict(model_train,newdata = lcf_tst_data)^(1/lambda) - lcf_tst_data$expenditure)^2)))
## [1] 195.5008
Start with a large model,
large_model = lm(expenditure ~ I(income ^ 2) + I(hhsize ^ 2) + (income + hhsize + internet + economicHRP + SEC3Class + tenureType + sex_HRP)^2, data = lcf_trn_data)
Performing backward AIC on the model,
large_model_back_aic = step(large_model, direction = "backward", trace = 0)
and backward BIC on the model,
large_model_back_bic = step(large_model, direction = "backward", trace = 0, k = log(length(resid(large_model))))
The characteristices of the models are
| Characteristic | Large Model | Model from backward AIC | Model from backward BIC |
|---|---|---|---|
| Number of predictors | 84 | 37 | 12 |
| Adjusted \(R^2\) | 0.5421597 | 0.5431252 | 0.5346943 |
| LOOCV RMSE | 199.0421315 | 198.0560146 | 199.3892669 |
| Shapiro-Wilks Test | Reject | Reject | Reject |
| BP Test | Reject | Reject | Reject |
The model obtained from the backward AIC search has the lower LOOCV RMSE and higher Adjusted \(R^2\). It did not pass the Shapiro-Wilks and BP tests. The Fitted vs Residual plot and the Normal QQ Plot (figure 7) were then looked at for this model.
assumption_test(large_model_back_aic)
Figure 7 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward AIC search
Figure 7 shows that the constant variance assumption is violated since there is not an even spread of the residuals about the zero line. The plot also shows that the normality assumption is violated.
These results indicate that a transformation of the response is required. Performing a BoxCox transformation on the model,
bc = boxcox(large_model_back_aic, plotit = TRUE, lambda = seq(0, 0.3, by = 0.1))
Figure 8
inter_lambda = bc$x[which.max(bc$y)]
indicates that the value of \(\lambda\) in \(\frac{{y^\lambda}-1}{\lambda}\) is 0.1939394. Applying this transformation to the large model,
large_model_trans = lm((((expenditure ^ inter_lambda)-1)/inter_lambda) ~ I(income ^ 2) + I(hhsize ^ 2) + (income + hhsize + internet + economicHRP + SEC3Class + tenureType + sex_HRP)^2, data = lcf_trn_data)
and performing a backward search using AIC,
large_model_trans_back_aic = step(large_model_trans, direction = "backward", trace = 0)
and a backward search using BIC,
large_model_trans_back_bic = step(large_model_trans, direction = "backward", trace = 0, k = log(length(resid(large_model_trans))))
yields the following results
| Characteristic | Large Model | Model from backward AIC | Model from backward BIC |
|---|---|---|---|
| Number of predictors | 84 | 40 | 19 |
| Adjusted \(R^2\) | 0.6177774 | 0.618087 | 0.6127457 |
| LOOCV RMSE | 1.3266462 | 1.3187038 | 1.3244992 |
| Shapiro-Wilks Test | Reject | Reject | Reject |
| BP Test | Reject | Reject | Reject |
The LOOCV RMSE values obtained are lower than those obtained before the transformation and the \(R^2\) value has increased. The Shapiro-Wilks and the BP tests are not passed by these models however.
The model chosen by the backward BIC search is smaller and has similar LOOCV RMSE and Adjusted \(R^2\) values to the backward AIC search and so will be used. The Fitted vs Residual plot and the Normal QQ Plot (figure 9) were then looked at for this model.
assumption_test(large_model_trans_back_bic)
Figure 9 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward BIC search
The plots show that the model assumptions are not being violated. This model was then tested to see how well it performed on predicting the expediture using the test dataset.
large_model_trans_back_bic_prediction = ((inter_lambda * predict(large_model_trans_back_bic, newdata = lcf_tst_data))+1)^(1/inter_lambda)
large_model_trans_back_bic_prediction_diff = (lcf_tst_data[,"expenditure"]-large_model_trans_back_bic_prediction)/lcf_tst_data[,"expenditure"]
The plot of the difference between the expenditure value predicted by the model and the actual values are shown in figure 10.
Figure 10 - difference between predicted and actual values
The plot shows that the predicted values are very close to the actual values and so the model is a good candidate to explain the relationship among these variables.
The RMSE for the additive model and the interaction model are
| Model | RMSE on training data | RMSE on test data |
|---|---|---|
| Additive | 204.5318118 | 195.5007985 |
| Interaction | 552.7602248 | 195.094629 |
1 University of Manchester, Cathie Marsh Institute for Social Research (CMIST), UK Data Service, Office for National Statistics. (2019). Living Costs and Food Survey, 2013: Unrestricted Access Teaching Dataset. [data collection]. 2nd Edition. Office for National Statistics, [original data producer(s)]. Office for National Statistics. SN: 7932, http://doi.org/10.5255/UKDA-SN-7932-2
2 Organisation for Economic Co-operation and Development. “Household Accounts - Household Spending”. oecd.org. https://data.oecd.org/hha/household-spending.htm (Accessed July 21st 2019).
3 D. Ferguson, B & Tandon, A & Gakidou, E & J. L. Murray, C. (2010). Estimating Permanent Income Using Indicator Variables. Health Systems Performance Assessment: Debates, Methods and Empiricism.